Boston is a cold city. The weather in Boston affect the city in various aspects. Among these, we are mainly looking for the relationship between weather and 311 reports.
Before we can ask any questions of our data, we need to put it in a form condusive to analysis. We start by loading each dataset, and cleaning up those columns that will be used for merging with other datasets. We also do any computations that are most easily done pre-merging.
# All imports go in this block
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from re import sub
from decimal import Decimal
from wordcloud import WordCloud, STOPWORDS
# Load the weather dataset
df_weather = pd.read_csv("data/Boston weather_clean.csv")
For the purposes of answering questions about Snow in Boston, it would be helpful to know a few things about how likely it is that snow is around.
First of all, it would be helpful if each day was labelled by whether it was the first day of snowfall, or the third day of a long snow storm. For this reason, we add the consecutive_snow_days column.
Next, for those days on which no snow was reported, it would be nice to know if there had recently been snowfall, which would cause snow to still affect 311 requests on that day. For this reason we added the days_since_last_snow column to the dataset.
Lastly, in case we want to look at accumulated snow over several days, we keep track of an accumulated_snow column, which adds up the snowfall in inches for consecutive snow days.
# Compute snow-related metrics
consecutive_snow_days = [0]
days_since_last_snow = [365]
accumulated_snow = [0]
for index, row in df_weather.iterrows():
if row['Events'] == 'Snow' or row['Events'] == 'Both' or row['Snowfall (in)'] > 0:
if consecutive_snow_days[-1] == 0:
accumulated_snow.append(float(row['Snowfall (in)']))
else:
accumulated_snow.append(accumulated_snow[-1] + row['Snowfall (in)'])
consecutive_snow_days.append(consecutive_snow_days[-1] + 1)
days_since_last_snow.append(0)
else:
accumulated_snow.append(accumulated_snow[-1])
consecutive_snow_days.append(0)
days_since_last_snow.append(days_since_last_snow[-1] + 1)
df_weather['consecutive_snow_days'] = consecutive_snow_days[1:]
df_weather['days_since_last_snow'] = days_since_last_snow[1:]
df_weather['accumulated_snow'] = accumulated_snow[1:]
# Load 311 dataset
df_311 = pd.read_csv('data/311.csv')
# Split out the date column into separate Year Month and Day columns
df_311['open_dt'] = pd.to_datetime(df_311['open_dt'])
df_311['Year'] = df_311['open_dt'].apply(lambda date: date.year)
df_311['Month'] = df_311['open_dt'].apply(lambda date: date.month)
df_311['Day'] = df_311['open_dt'].apply(lambda date: date.day)
df_311_weather = pd.merge(df_311, df_weather, on=['Year', 'Month', 'Day'])
df_income = pd.read_csv('data/median income.csv', delimiter=";")
def convert_neighborhood(neighborhood):
conversions = [("Allston / Brighton", "Allston/Brighton"),
("Allston", "Allston/Brighton"),
("Brighton", "Allston/Brighton"),
("Back Bay", "Back Bay/Beacon Hill"),
("Beacon Hill", "Back Bay/Beacon Hill"),
("Fenway / Kenmore / Audubon Circle / Longwood", "Fenway/Kenmore")]
for left, right in conversions:
if neighborhood == left:
return right
return neighborhood
df_311_weather['neighborhood'] = df_311_weather['neighborhood'].apply(convert_neighborhood)
# Clean up dollar strings into numbers
df_income['median household income'] = df_income['median household income'].apply(lambda money: Decimal(sub(r'[^\d.]', '', money)))
df_311_weather_income = pd.merge(df_311_weather, df_income, left_on="neighborhood", right_on="region", how="left")
winter_2014_2015 = (df['Year'] == 2014) & (df['Month'] > 8) | ((df['Year'] == 2015) & (df['Month'] < 5))
df = df_311_weather_income.filter(['Year', 'Month', 'Day', 'case_title', 'reason', 'type', 'neighborhood', 'High Temp (F)', 'Avg Temp (F)', 'Low Temp (F)', 'High Wind (mph)', 'Avg Wind (mph)', 'High Wind Gust (mph)', 'Snowfall (in)', 'Precip (in)', 'Events', 'consecutive_snow_days', 'days_since_last_snow', 'accumulated_snow', 'median household income'])
df
# Make sure plt is in a clean state
plt.rcdefaults()
# changes the type of the values under the column “case_title”
# into a string
df["case_title"] = df["case_title"].astype(str)
# function that generates graphs according to
# which dataset has been passed in
def graph_snow_requests_vs_actual_requests(df):
# lists all the data in which there was snowfall
# by finding all the values under the “snowfall (in)” column that
# has a value greater than 0.
snowed = df["Snowfall (in)"] > 0
snow_data = df[snowed]
# filters dataset to give data in which the case title of the 311 report
# includes the string “snow”
has_snow = df["case_title"].str.contains("Snow")
reason_snow = df[has_snow]
# lists all the data in which there wasn't snowfall
# by finding all the values under the “snowfall (in)” column that
# has a value of exactly 0.
no_snow = df["Snowfall (in)"] == 0
# code to get dataset in which there wasn't snow but still had report(s) on snow
# by filtering dataset by the conditions in which it didn’t snow, yet has a
# 311 report on that day in which the report title had the word “snow” in it.
no_snow_data = df[no_snow & has_snow]
# counts the number of instances in which there was no snow, yet there was
# at least one 311 report that had the word “snow” in its title
count_no = no_snow_data['case_title'].count()
# data has been filtered to show only those in which there was snow, and also
# 311 reports with the word “snow” in its title
yes_snow_data = df[snowed & has_snow]
# counts the number of instances in which there was snow, and also
# at least one 311 report that had the word “snow” in its title
count_yes = yes_snow_data['case_title'].count()
# code gets all the data in which the case title of the 311 report does
# NOT have the word “snow” in it
no_word = df["case_title"].str.contains("Snow") == False
reason_nosnow = df[no_word]
# filters dataset to get all the data in which there was snowfall, but no 311
# report with the case title having the word “snow” in it
yesSnow_noReport = df[snowed & no_word]
# counts the number of instances in which there was snowfall, but no
# 311 report that had the word “snow” in its case title
count_noSnowReport = yesSnow_noReport['case_title'].count()
# filters dataset to get all the data in which there was no snowfall,and no 311
# reports with the case title having the word “snow” in it
noSnow_noReport = df[no_snow & no_word]
# counts the number of instances in which there was no snowfall, and also no
# 311 reports with the word “snow” in its case title
count_noSnownoReport = noSnow_noReport['case_title'].count()
# First graph
# Code to draw a bar graph where we can see the number of 311 reports that
# either has the word “snow” in its case title or not when there was snowfall.
# The values of the bars are from the count of instances in which there was
# snowfall but no 311 reports that had the word “snow” in the case title, and
# another from when there was snowfall and also had 311 reports that had a case
# title with the word “snow” in it.
x_pos = ('"snow" exists', '"snow" doesnt exist')
y_pos = np.arange(len(x_pos))
performance = [count_yes,count_noSnowReport]
plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, x_pos)
plt.ylabel('Number of 311 requests')
plt.xlabel('Whether the word "snow" exists within the 311 report title')
plt.title('Having the word "snow" in 311 requests when there WAS snowfall')
plt.show()
# Second graph
# Code to draw the bar graph in which we can see the number of 311 reports that
# either has the word “snow” in its case title or not when there was NO snowfall.
# The values of the bars are from the count of instances in which there was no
# snowfall, and also no 311 reports that has the word “snow” in its case title, and
# another count from when there was no snowfall, yet there were 311 reports in which
# the case title had the word “snow” in its case title
x_pos = ('"snow" exists', '"snow" doesnt exist')
y_pos = np.arange(len(x_pos))
performance = [count_no,count_noSnownoReport]
plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, x_pos)
plt.ylabel('Number of 311 requests')
plt.xlabel('Whether the word "snow" exists within the 311 report title')
plt.title('Having the word "snow" in 311 requests when there WAS NO snowfall')
plt.show()
graph_snow_requests_vs_actual_requests(df)
Note: This dataset is from 2011 to 2018, inclusive of all seasons
There are more reports in which the word “snow” doesn’t exist within the case title even though there is snow, and there are many reasons as to why that might be the case. First of all, “snowfall” in our dataset is anything where there is more than 0 inches of snow, and when there is only an inch or so of snow, it doesn’t really negatively impact the community so much. Also, not all snow-related reports include the word “snow” in the title. It can be something like “frozen road”, or “requests for street cleaning” and so forth, which means those reports will be omitted from our count because we are only looking at data in which the word “snow” is explicitly included in the title.
Looking at the graph, there are more reports in which the word “snow” doesn’t exist within the case title when there is no snow. The most obvious reason as to why there are more reports explicitly unrelated to snow is because this dataset includes all seasons from 2011 to 2018. Boston gets pretty snowy during the winter months, but it is rare to have snow from spring to fall, and because our dataset includes all seasons, obviously, we will have many more reports that are unrelated to snowfall.
graph_snow_requests_vs_actual_requests(df[winter_2014_2015])
Note: This dataset is from 2014 to 2015, only including the winter months
Unlike the bar graph above with the same question, with this dataset that only includes the winter months, we now have more 311 requests with the word “snow” in the case title. We explicitly picked the winter of 2014 to 2015 because those were the years in which there was a lot of snowfall, and probably because of that reason, we subsequently also had a lot more 311 requests relating to snow. Of course like what I have stated above, this count omits requests that do not explicitly have the word “snow” in the case title, but even then, we still have more reports with the word “snow” within the title. This fact tells us that the winter of 2014-2015 was indeed quite severe in terms of the degree of how much the weather negatively affected the residents of Boston.
Similar to the bar graph above with the same question, the graph of the dataset that only includes the winter months also has more 311 requests where the word “snow” doesn’t exist in the case title when there was no snowfall. The reason why is probably due to the fact that because there was no snow for the day, there were also less problems, or at least less severe problems related to snow. We still do have reports explicitly relating to snowfall, and those are probably due to the remnants of snow from the snowy weather the day before or so.
Next, we investigate a few types of extreme weather, and how these types of weather may affect the kinds of 311 requests that come in. These are windy, snowy, chilly. It is interesting to see which sorts of requests occur more for certain types of weather than others.
One visually effective way to observe trends in word choice is by generating word clouds. Below, we define which requests fit our criteria for each weather type, and then we generate wordclouds from the request text for those requests.
## WINDY ##
windy = df['Avg Wind (mph)'] > 30
## SNOWY ##
snowy = df['days_since_last_snow'] < 2
## CHILLY ##
celcius = df['Low Temp (F)'].apply(lambda x: (x - 32) * (5.0/9.0))
chilly = celcius < (-20)
rainy = df['Precip (in)'] > 2.5
make_wordcloud(df[rainy])
def make_wordcloud_from_text(text):
wordcloud = WordCloud(
width = 3000,
height = 2000,
background_color = 'white',
stopwords = STOPWORDS).generate(text)
fig = plt.figure(
figsize = (20, 15),
facecolor = 'white',
edgecolor = 'white')
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()
def make_wordcloud(df):
text_cols = ['reason', 'type']
make_wordcloud_from_text(" ".join(df[col_name].str.cat(sep=" ") for col_name in text_cols))
The first type of extreme weather that we showcase is the snowy case. We define this to be days on which it snowed, as well as the two days immediately following a snow day. The biggest words are Snow Plowing Request, which fits our expectations. There are also requests to repair potholes, and reports of missed trash pickups. There are many requests for Scheduled Bulk Item Pickup, which baffles us a little bit - perhaps these are reports of missed pickups.
An interesting snow-related request is "Parking Enforcement" - I wonder how many of these are people who had marked their spot with lawn chairs. More likely, since snow makes many previous parking spaces unviable, people are more likely to park illegally and thus trigger more parking enforcement-related 311 calls.
make_wordcloud(df[snowy])
In the wordcloud below, observe the words that come up most often for windy days (days with an average windspeed of 30 mph or higher). Tree related requests, as well as Downed Wire reports are very common. Street lights, traffic lights, signal repair, all seem to show that Boston's traffic flow control infrastructure is vulnerable to high winds. There are also many requests for street cleaning
make_wordcloud(df[windy])
Snow is only one of the challenges Bostonians face in the winter. The other one is just sheer cold. What kinds of calls do Bostonians make in subfreezing temperatures?
The biggest words by far come from the request "Heat Excessive Insufficient" which seems like a catch-all type for heating related issues.
make_wordcloud(df[chilly])
Here, we correlate the frequency of 311 reports with each region in Boston by counting the occurrences of each neighborhood in the "neighborhood" column of the 311 dataset.
# Data manipulation: get counts of reports by neighborhood
def plot_reports_by_region(df):
value_counts = df["neighborhood"].value_counts()
x_pos = np.arange(len(value_counts.keys()))
plt.bar(x_pos, list(value_counts[key] for key in value_counts.keys()), align='center',
color='green', ecolor='black')
plt.xticks(x_pos, value_counts.keys(), rotation='vertical')
plt.xlabel("neighborhood")
plt.ylabel("frequency")
plt.title("region correlated with frequency of 311 reports")
plt.show()
plot_reports_by_region(df)
The data is sorted from the most to the least. From the image, we can tell that Dorchester has the most 311 reports, which is a lot more than the other regions. We looked up the population of each region and found that Dorchester has the largest population following by Roxbury and Brighton. Thus we guess that the frequency of 311 reports of each region has a direct relationship with the population of that region.
After looking at which regions have the most 311 requests, we looked at the median household income of each of those regions to try to find a correlation between median household income and frequency of 311 requests.
# Data manipulation: get counts of reports by region
value_counts = df["median household income"].value_counts()
# Plot how many of our reports come from low vs high income areas
plt.scatter(list(map(lambda x: int(x), value_counts.keys())), list(value_counts[key] for key in value_counts.keys()))
plt.xlabel("income ($)")
plt.ylabel("frequency")
plt.title("income correlated with number of 311 reports")
plt.show()
We relate the median income of each region and the frequency of 311 reports into a dot plot. Looking through the median income, most of the corresponding frequency data are gathered in the rage of 2500 to 10000, regardless of the income. We thought there might be more reports in the relatively poor regions, however we can’t draw any pattern from the dot plot. Thus the income and the frequency of 311 reports don’t have a direct relationship.
Correlate 311 with demographics (income, race, etc)
correlate region with frequency of 311 reports correlate income with frequency of 311 reports
correlate weather with type of 311 report correlate weather with number of 311 reports on that day
pd.set_option('display.max_columns', None)
plt.hist(df['consecutive_snow_days'][df['type'] == 'Request for Snow Plowing'], bins=20)
plt.show()
# Data manipulation: Put snowfall into bins
plt.hist(df['High Wind Gust (mph)'], bins=50)
plt.show()
# What is the average days_since_last_snow for snow removal requests
# 1. filter the table for only entries about snow removal
plow_requests = df[df['case_title'].str.contains("Snow", na=False)]
# 2. for those entries, compute the average of days_since_last_snow
select_by = (plow_requests['Year'] == 2014) & (plow_requests['Month'] > 8) | ((plow_requests['Year'] == 2015) & (plow_requests['Month'] < 5))
print(plow_requests[select_by]['days_since_last_snow'].mean())
print(plow_requests['days_since_last_snow'].mean())
# For longer consecutive snow days are there more requests?
# df[has_snow].filter(['Year', 'Month', 'Day', 'days_since_last_snow']).groupby(by=['Year', 'Month', 'Day']).groupby(by='days_since_last_snow').count()